Mlcoalsim: multilocus coalescent simulations.

Coalescent theory is a powerful tool for population geneticists as well as molecular biologists interested in understanding the patterns and levels of DNA variation. Using coalescent Monte Carlo simulations it is possible to obtain the empirical distributions for a number of statistics across a wide range of evolutionary models; these distributions can be used to test evolutionary hypotheses using experimental data. The mlcoalsim application presented here (based on a version of the ms program, Hudson, 2002) adds important new features to improve methodology (uncertainty and conditional methods for mutation and recombination), models (including strong positive selection, finite sites and heterogeneity in mutation and recombination rates) and analyses (calculating a number of statistics used in population genetics and P-values for observed data). One of the most important features of mlcoalsim is the analysis of multilocus data in linked and independent regions. In summary, mlcoalsim is an integrated software application aimed at researchers interested in molecular evolution. mlcoalsim is written in ANSI C and is available at: http://www.ub.es/softevol/mlcoalsim.


Introduction
Statistical inference of molecular population data under different evolutionary models typically employs a coalescent framework (Kingman, 1982a,b;Hudson, 1990;Donnelly and Tavaré, 1995;Nordborg, 2001). Hudson's ms (Hudson, 2002) application enabled a large number of population geneticists and molecular biologists to examine data under different evolutionary models. In recent years, a number of coalescent programs focused on the generation of genetic data have been published (e.g. SimCoal, Excoffi er et al. 2000;Laval and Excoffi er, 2004;SelSim, Spencer and Coop, 2004;CoaSim, Mailund et al 2005;FastCoal, Marjoram and Wall, 2005). Nevertheless, multilocus data obtained by high throughput techniques (e.g. the Drosophila Polymorphisms Sequencing Project, as well as smaller projects such as those described by Akey et al. 2004;Schmid et al. 2005) are not easily analyzed using available software. Here we describe the mlcoalsim software application which, unlike other available tools, allows the generation of simulated genetic data and the calculation of descriptive statistics for a large number of loci under different evolutionary models, as well as obtaining P-values of observed data.

Program Overview
mlcoalsim enables researchers to compare single and multilocus data with several common evolutionary models. It is an integrated application that not only constructs coalescent trees and sequences but also calculates a number of summary statistics that are useful for the examination of evolutionary hypotheses. This program is designed to generate within-species genetic data; that is, the level of nucleotide variation should not be too high-a maximum of approximately 5%-in order to avoid important errors (a more sophisticated substitution model should be used). For the same reason, the level of divergence from an outgroup species should be no greater than 10-15%.

Multilocus analyses
One of the main features of mlcoalsim is the generation of DNA samples and calculation of a number of statistical tests for a set of multiple loci with variable levels of intragenic recombination. There are two options for multilocus analysis: using independent (unlinked) loci and using a single long region separated into several fragments. The first option (independent loci) allows the independent analysis of each locus and the calculation of summary statistics (the average and variance for all loci of each statistic). This option is useful for contrasting data with demographic models that would affect the entire genome. For such an analysis, a correction factor for population size depending on the chromosomal location of each locus (e.g. autosomal, sexual) is needed. The second option (linked loci) generates samples for an entire linked region and calculates statistics for specified fragments within this region or for a sliding window analysis. The "linked" option is useful in evolutionary processes that affect only specific regions, such as a selective sweep in a recombining region.

Uncertainty in mutation and recombination rates
Mutation and recombination rates are critical parameters which are usually unknown. In order to consider the uncertainty of these two parameters, mlcoalsim can sample the rates from a distribution (uniform and gamma distributions are used) instead of using a fixed value. In addition, mlcoalsim can generate samples by fixing the observed values, the number of segregating sites, the minimum number of recombination events and (optionally) the number of haplotypes. This last option is obtained using the rejection method 2 of Tavaré (Tavaré et al. 1997). Posterior distributions for the population mutation and for the population recombination parameter are recorded.
Heterogeneity in mutation and recombination rate across the sequence mlcoalsim is also able to take into account differences in the mutation and in recombination rates across the studied region. Heterogeneity is modelled with a gamma distribution, modelling from extreme hotspots regions (i.e. in case using heterogeneity for the recombination rate, only few position are enabled to recombine while others can not) to uniform values for all positions. Furthermore, it is possible to fi x the average number of invariant positions (position that can not mutate) for the studied region.
Evolutionary models mlcoalsim includes the following evolutionary models: the neutral stationary panmictic model, the fi nite island model, models with changing population sizes over time, refugia models and deterministic positive selection (not all of these models can be used simultaneously). mlcoalsim allows the use of neutral and positive selection models for different independent loci, and changing population size also can be used with a fi nite island model.

Statistics
A number of statistics and related tests used in population genetics are displayed in the output ( Table 1). The statistics incorporated in this program describe the level and patterns of diversity for a given sample. Different statistics that estimate the level of variation are included (θ, Watterson, 1975, π, Tajima, 1983, and θ H , Fay and Wu, 2000) for the entire sample. Although these estimates are calculated using different approaches, the values should be equal under the assumption of a neutral stationary panmictic model. The average levels of variation within and among populations are also estimated (π w and π b , Hudson et al. 1992), as well as the average differentiation among populations with the Fst statistic (e.g. Hudson et al. 1992).
A description of the patterns of diversity is obtained using two main classes of statistics (Ramos-Onsins and Rozas, 2002): Class I statistics, which use the mutation frequency information, and Class II statistics, which use information from the haplotype distribution. Class I includes Tajima's D test (TD, Tajima, 1989), Fu and Li's tests (FD*, FF*, FD, FF, Fu and Li, 1993), Fay and Wu's H test (Fay and Wu, 2000), R2 (Ramos-Onsins and Rozas, 2002) and weighted statistics for a multilocus approach such as D/Dmin (Schaeffer, 2002) and H/Hmin (Schmid et al. 2005). Class II includes the number of haplotypes Kw (Strobeck, 1987) and the haplotype diversity Hw (Depaulis and Veuille, 1998), both weighted by the number of samples for a better multilocus comparison, Fs (Fu, 1997), the statistics B and Q, (Wall, 1999) which count differences in haplotype structure at adjacent positions, the ZA statistic (Rozas et al. 2001) as a measure of linkage disequilibrium at adjacent positions, maxhap (Depaulis et al. 2003) and maxhap1 (simplifi ed from Hudson et al. 1994), which counts the number of lines with the most common haplotype (i.e. maxhap) but allowing a single segregating site within the largest "haplotype" group (Rozas et al. 2001). Finally, the minimum number of recombination events, Rm (Hudson and Kaplan, 1985), is also calculated. Multilocus analyses generate a comprehensive output with the calculated statistics with their average and variance. Only biallelic positions are considered for the analyses given that tri-or tetraallelic positions are rare in within-species samples.

Other technical features
The generation of random deviates from uniform, binomial, Poisson, and gamma distributions and the determining of roots for complex functions are based on Lanczos (1964); Atkinson (1979); Cheng and Feast (1979); Fishman (1979); Ridders (1979);  and . The Rm function was obtained and modified from Wall's code (Wall, 2000). The gamma function was partially obtained from Grassly, Adachi and Rambaut code (Grassly et al. 1997).